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ABSTRACT 


With  the  advances  in  VLSI  technology,  it  will  be  possible  to 
fabricate  chips  with  100,000  to  500,000  gates  per  chip.  Rather  the 
technology  to  pack  more  and  more  elements  on  a  chip  has  outpaced  the 
collectiye  knowledge  for  effective  use  of  chip  "real  estate  .  For 
example,  it  is  virtually  impossible  to  test  high  density  microcircuits. 

This  report  reviews  the  existing  literature  on  VLSI  technology  with 
regards  to  proposed  methods  to  increase  reliability  and  testability. 

One  of  the  Critical  problems  of  high  density  microcircuits  is  the 
limited  number  of  I/O  pins.  The  present  literature  points  out  the 
two  types  of  circuit  additions  that  can  improve  circuit  reliability. 

The  report  also  provides  a  list  of  references  for  further  study 
of  Fault-Tolerant  Computing.  ^ ^ 
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INTRODUCTION 


Contemporary  integrated  circuits  contain  as  many  components  as  the 
largest  computing  systems  of  15  to  20  years  ago.  ^  The  1960 's  were 
the  decade  of  gate  level  design,  the  1970’s  were  the  decade  of  register- 
transfer-  level  design  and  the  1980’s  will  be  the  decade  of  processor- 
memory  switch  (PMS)  design.  The  age  of  VLSI  is  here  and  its  technology 
is  presenting  interesting  potentials  as  well  as  challenges.  The 
advantages  of  VLSI  include  reduction  in  support  cost,  improved  reli¬ 
ability  and  improved  fault  detection.  Some  of  the  challenging  issues 
include  partitioning,  fault  models  and  dependencies,  efficient  use  of 
redundancy,  role  of  reliability  tools,  hierarchical  complexity,  self¬ 
test  during  operation,  redundancy  to  enhance  yield,  and  self-test  at 
fabrication  time. 

As  is  the  case  in  most  design  efforts,  three  primary  factors 
were  considered  and  traded  off  against  each  other  in  the  design  of 
computer  systems:  cost,  performance  and  fault  tolerance.  In  the  past, 
systems  engineers  had  realized  that  any  attempt  to  improve  significantly 

any  one  of  these  factors  while  holding  another  constant  meant  significant 

m 

degradation  in  the  third  factor.  v  The  advent  of  VLSI  however,  seems 
to  have  affected  that  situation  dramatically.  With  VLSI,  the  economics 
are  different  and  the  rate  of  increase  in  cost  as  a  function  of  added 
gates  is  greatly  reduced.  Thus ,  if  a  system  designer  wants  to 
increase  fault  tolerance  by  adding  circuitry  while  holding  performance 
constant  (or  perhaps  even  increasing  performance)  the  net  increase  in 
manufacturing  cost  of  the  machine  will  be  very  small  in  comparison  to 
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costs  for  conventional  designs  of  the  past.  v  J  As  the  era  of  VLSI  and 
VHSI  circuits  emerges,  one  integral  chip  will  inpart  electronic  functions 
of  former  systems.  This  factor  tends  to  improve  physical  design  economics, 
improve  performance  and  on  the  surface,  improve  reliability  and  system 
availability.  But  problems  of  initial  yield  in  chip  fabrication  such 
as  increased  complexity,  cost  of  testing,  etc,  dramatically  erode  the 
economic  advantages  of  increased  circuit  densities.  In  particular,  the 
technology  to  pack  more  and  more  active  elements  on  a  chip  has  outpaced 
the  collective  knowledge  for  systematic  and  effective  use  of  chip  "real 
estate".  For  example  it  is  virtually  impossible  to  test  high  density 
micro- circuits  in  the  laboratory  and  certainly  not  in  their  operational 
evironment. 

The  phase  I  of  this  research  is  to  review  literature  and 'investigate 
the  technological  and  economic  feasibility  of  functional  sub-circuit 
partitioning  which  elevates  redundancy  to  the  sub- circuit  level  (and 
beyond)  to  include  higher  levels  of  fault  detection  and  fault  isolation 
capability  on  a  single  chip.  The  purpose  of  the  literature  review  is 
to  ferret  out  key  factors  relevant  to  the  design  stage  of  electronic 
circuitry  that  dominantly  affect  end-user  utility  in  the  positive  and 
negative  sense.  Another  facet  of  the  literature  review  is  to  acquaint 
the  researchers  with  the  immense  literature  base  for  electronic 
technology  applicable  to  military  avionics. 
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LITERATURE  REVIEW 


In  the  past,  the  design  of  fault- tolerant  computing  systems  has 
been  done  in  an  ad  hoc  manner.  The  absence  of  a  unified  theory  of 
fault- tolerant  computing  can  be  attributed  to  at  least  two  factors: 

1.  The  high  cost  of  hardware  which  in  the  past  has  limited  the 
use  of  redundancy  techniques. 

2.  A  lack  of  understanding  of  the  basic  definitions  and  goals  of 
fault- tolerant  computing. 

Avizienisi  ^  defined  a  fault- tolerant  computer  as  one  which  is  free 
from  hardware  and  software  design  faults  and  can  execute  its  programs 
correctly,  obtaining  correct  results  within  specified  time  limits, 
despite  the  presence  of  transient  or  permanent  operational  faults. 

The  first  special  issue  on  Fault  Tolerant  Computing  was  published 
in  the  IEEE  Transactions  on  Computers  nearly  a  decade  ago.  ^  Six 
other  special  issues  devoted  to  this  same  topic  (5,6,7,8,9,10)  contain 
a  respresentative  sample  of  the  research  activities  that  have  taken 
place  in  fault -tolerant  computing  over  the  past  decade. 

The  most  obvious  trend  is  the  increasing  concern  with  the 
effects  of  large  scale  integration  on  fault- tolerance  techniques  that 
were  effective  for  computers  implemented  with  SSI  or  MSI  circuits. 
Different  failure  modes  may  be  anticipated  as  the  scale  of  integration 
increases;  testing  and  diagnostic  procedures  that  were  appropriate 
ten  years  ago  may  be  totally  inadequate  for  the  much  more  highly 
integrated  circuitry  of  today. 

Fault- tolerance  has  now  come  to  be  recognized  as  a  desirable  and 
in  some  cases  an  essential  feature  of  a  wide  range  of  computing  systems. 
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Interest  in  computers  capable  of  long  maintenance  free  operation  has 
been  matched  by  interest  in  low  maintenance  or  scheduled  maintenance 
commercial  systems;  inflight -control  avionic  systems  that  provide 
extremely  high  availability,  process  control  and  telephone  switching 
systems.  Also  the  fact  that  as  more  and  more  memory  cells  are  packed 
into  a  single  chip,  the  number  of  failure  modes  increases  and  the  need 


for  efficient  algorithms  to  detect  faults  in  them  becomes  more  critical. 

(ID 

The  formulation  of  the  concepts  of  self  checking  logic  and  hybrid 


redundancy 


appear  to  be  two  important  steps  toward  a  general  theory 


of  fault -tolerant  computing.  Hybrid  redundancy  bridges  the  gap  between 
static  and  dynamic  redundancy  schemes  wfiile  self  checking  design  enables 
us  to  distribute  the  monitoring  function  among  the  sub-systems.  Sedmak 


and  Liebergol, 


in  describing  a  design  approach  for  fault -tolerant 


general  purpose  computers  implemented  with  VLSI,  noted  that  there  are 
significant  problems  in  using  some  conventional  fault- tolerant 


techniques  in  VLSI  implementations.  Their  approach  is  to  implement 
all  the  logic  needed  to  detect  faults  in  a  VLSI  chip  directly  on  the 
chip  itself  and  to  design  and  partition  this  logic  so  as  to  minimize 
the  possibility  that  any  failure  mode  is  capable  both  of  causing  a  chip 
to  malfunction  and  of  simultaneously  making  it  capable  of  reporting 
this  fact. 
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c.  Rapid  fault -isolation,  without  human  intervention  will  result 
in  low  down  time. 

The  fault -tolerant  aspects  of  the  VLSI  circuits  are  fault  detection, 
isolation  and  recovery.  By  the  mid  1980' s  it  is  expected  that  speeds 

O 

for  military  avionics  circuits  will  approach  10  operations  per  second 

and  the  density  of  circuits  will  increase  and  the  testing  methods  will 

become  expensive  and  complex  if  not  impossible.  Since  circuit  densities 

in  the  range  of  10  8  to  5  x  10  8  gates  per  chip  are  within  the  reach 

of  VLSI  technologies  today testability  and  reliability  will  become 

dominant  issues.  There  is  a  large  amount  of  literature  on  fault -tolerant 

computing,  beginning  with  Von  Neumann and  Moore  and  Shannon.  Of 

the  recent  literature  on  fault- tolerant  computing,  a  large  number  of 

papers  have  focused  on  fault- tolerant  memory.  Reviews  of  semiconductor 

memory  have  been  given  by  Eimbinder,  Riley  ^18-*and  Lenke  et  al.^1^ 

Several  important  past  studies  are  pertinent  to  the  background  of 

this  research.  Agarwal  addresses  the  problem  of  detecting  faults  in 

programmable  logic  arrays  (PLA*  s).  He  points  out  that  such  devices  are 

vulnerable  to  a  unique  class  of  contact  faults  and  he  develops  a  PLA 

model  where  in  these  faults  can  be  represented.  Crouzet  and  Laudrault 

contend  that  an  LSI  circuit  can  be  made  self-checking  if  it  is  designed 

from  the  out- set  with  that  goal  in  mind  and  if  its  various  possible 

failure  modes  are  well  understood.  Satish  Thatte  of  TTI  proposes 

several  methods  for  improving  VLSI  testability.  He  proposes  to 

utilize  self- checking,  on  chip  testing,  partitioning  and  exhaustive 

checking  and  micro  diagnostics.  ^ 

Lee,  G.hsni  and  Heron suggest  the  use  of  recovery  cache  to 

take  care  of  faults  due  to  software.  The  function  of  a  cache  is  to 

store  all  data  that  would  be  needed  to  restore  a  machine  to  the  state 

it  was  prior  to  the  execution  of  any  defective  software  module. 
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A  large  amount  of  recent  work  has  dealt  with  reconfiguration 

(manual  and  automatic)  to  improve  system  performance.  Mathur  and  de 

Sousa  ^  "^presented  a  technique  which  uses  configurable  NMR.  Papers  by 

Cox  and  Carroll and  Hartwell  et  al.  describe  approaches  which 

involve  swapping  memory  bit  planes.  Srinivasan's approach  combines 

triple-modular  redundancy  (TRM)  with  matching  the  address  decoder 

to  the  particular  faulty  array.  Mehta  et  al  v  'also  addressed  the  issue 

C12') 

of  internal  redundance  while  Carter  and  Schneider^  'discussed  the  basic 
principles  of  self- checking  circuits. 

Arnold showed  the  importance  of  having  fault  detection  and 
recovery  for  all  the  elements  of  a  machine  in  order  to  achieve  high 
reliability.  Tanaka  et  al^^ described  the  use  of  duplication  in 
some  parts  of  a  general  purpose  computer. 


(6) 


SUMMARY 


TRe  critical  problem  with  the  VLSI  circuits  is  the  limited 
number  of  I/O  pins.  As  the  number  of  functions  that  can  be  done  has 
increased,  the  pin  outs  have  remained  basically  the  same.  This  presents 
problems  in  that  the  chips  have  to  be  partitioned  to  make  use  of 
the  limited  I/O  pins.  Likewise,  the  interconnection  between  chips  has 
become  more  difficult  because  of  limited  I/O  pins. 

Since  "soft  faults"  can  no  longer  be  isolated  by  "hard  fault" 
error  detection  devices,  self- testing  methods  must  be  used  to  improve 
diagnostics  and  testing  must  move  into  the  design  level.  As  systems 
are  currently  partitioned  into  sub- systems,  so  can  VLSI  chips  be 
partitioned  into  subcircuits.  Critical  subcircuits  may  be  duplicated 
to  provide  redundant  paths  automatically  after  errors  are  detected. 

The  device  density  gain  will  be  traded  off  for  circuit  reliability 
as  we  try  to  pack  more  and  more  elements  on  chips.  It  is  also 
envisioned  that  some  redundant  circuits  will  be  provided  to  allow 
testability  since  high  density  circuits  make  it  impossible  to  test 
circuits  in  the  laboratory  and  in  the  field. 

Following  are  the  main  methods  used  to  increase  reliability  and 
testability.  The  first  method  is  called  fault  masking.  This  is  a 
process  of  masking  a  fault  without  really  knowing  the  nature  or  location 
of  the  fault.  Triple  modular  redundancy  (TMR)  and  two  rail  networks 
(TRN)  are  techniques  used  in  fault  masking.  These  designs  have  a  high 
redundant  ratio  i.e.  ratio  of  hardware  in  redundant  circuits  to  that  of 
a  nonredundant  circuit  is  3:1  to  4:1.  The  second  method  called  self- 
checking  is  a  process  which  detects  possible  error  conditions  in  the 
circuit  and  produces  an  error  signal  which  can  be  used  to  (1)  stop 
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computations ,  (2)  signal  manual  repair  and  (3)  initiate  reconfiguration 
of  the  circuit  operation.  The  simplest  form  of  this  is  complete 
duplication.  The  redundancy  ratio  of  this  type  of  circuit  is  greater 
than  2:1.  Another  type  of  self-checking  circuit  used  forced-parity 
which  involves  adding  additional  hardware  to  the  checking  circuit  which 
possesses  the  ability  to  generate  an  odd  parity  error. 

Another  design  technique  that  is  being  extensively  used  by  IBM 
is  called  the  Level-Sense  Scan  Design  (LSSD).  The  basic  premise  of 
this  technique  is  that  no  feedback  loops  can  be  used  in  the  chip. 

All  loops  must  be  broken  down  and  replaced  by  serial  shift  registers/ 
latches.  The  latches  are  then  brought  out  to  pins,  there  by  allowing 
access  to  previously  unreachable  nodes  within  the  chip  thus  having 
the  shift  register  pin  will  also  allow  a  test  pattern  to  be  entered 
into  the  chip  register.  From  the  work  done  so  far,  it  is  apparent 
that  additional  searching  of  the  leterature  is  necessary.  A  bibliography 
at  the  end  of  the  report  offers  additional  helpful  material  that  has 
not  yet  been  studied  by  the  researchers.  This  list  will  include 
additional  literature  that  is  being  generated  on  this  topic  by 
various  research  organizations. 

Equipment 

A  Motorola  M6809  Development  System  was  purchased  under  the  Phase- 
I  of  the  project.  This  development  system  will  be  used  to  conduct- 
tests  on  chip  reliability  and  testability  of  various  proposed  designs. 
This  equipment  will  also  be  used  to  train  graduate  engineers  who  will 
become  well  versed  in  VLSI/VHSI  principles  that  will  include 
considerations  of  economics,  scale,  speed  and  circuit  utility. 
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